We will be analyzing a data, which are label-free, bottom-up and generated by data-dependent acquisition. Raw spectra were searched using MaxQuant software, and we will start with the resulting proteinGroups.txt file, containing information about proteins identified and their quantification.
It is an interactomics study of ubiquitin interactors, based on articles of Zhang, et al. 2017 and Zhang et al., 2018.
For the analysis, we will be using DEP package (Differential Enrichment analysis of Proteomics data).
We will be using several libraries:
Install the required packages using install.packages() or BiocManager, don’t forget to comment the installation commands afterwards.
# install.packages("package")
# OR
# library("BiocManager")
# BiocManager::install("package")
library(here)
library(DEP)
library(dplyr)
library(DT)
library(SummarizedExperiment)
On input, there is usually proteinGroups.txt file, generated by MaxQuant software. Read it into R using read.delim() or read.delim2() function.
data <- read.delim(here('data', 'proteinGroups.txt'))
#data <- UbiLength # alternatively, use this command to get the data
nrow(data)
## [1] 3006
We will filter out the most common contaminants:
By change of nrow() you can see how many proteins were filtered out
data <- data %>%
filter(Reverse != "+") %>%
filter(!grepl("cRAP", Majority.protein.IDs)) %>%
filter(Only.identified.by.site != "+") %>%
filter(!grepl("keratin", Fasta.headers)) %>%
filter(!grepl("Keratin", Fasta.headers)) %>%
filter(!grepl("PE=5", Fasta.headers))
nrow(data)
## [1] 2910
Dataset has unique Uniprot identifiers, however those are not immediately informative. The associated gene names are informative, however these are not always unique. For further analysis these proteins must get unique names. Additionally, some proteins do not have an annotated gene name and for those we will use the Uniprot ID.
# Are there any duplicated gene names?
data$Gene.names %>% duplicated() %>% any()
## [1] TRUE
# Make a table of duplicated gene names
data %>% group_by(Gene.names) %>% summarize(frequency = n()) %>%
arrange(desc(frequency)) %>% filter(frequency > 1)
## # A tibble: 46 x 2
## Gene.names frequency
## <chr> <int>
## 1 "" 11
## 2 "ATXN2" 4
## 3 "SF1" 4
## 4 "ATXN2L" 3
## 5 "RBM33" 3
## 6 "UGP2" 3
## 7 "ACTL6A" 2
## 8 "BCLAF1" 2
## 9 "BRAP" 2
## 10 "CALU" 2
## # ... with 36 more rows
# Make unique names using the annotation in the "Gene.names" column as primary names and the annotation in "Protein.IDs" as name for those that do not have an gene name.
data_unique <- make_unique(data, "Gene.names", "Protein.IDs", delim = ";")
We will now create a SummarizedExperiment object and further work with it instead of classical dataframe.
For that, we need the columns containing intensities, and experimental design.
Important: by creating SummarizedExperiment, the protein intensities get log2 transformed! No need for a separate log2 transformation.
# grep the intensity columns
intensity_columns <- grep("LFQ.", colnames(data_unique))
# more often you will be using Intensity instead of LFQ intensity, so grep("Intensity.", colnames(data_unique))
# create experimental design
exp_design <- data.frame(
label = c("Ubi4_1", "Ubi4_2", "Ubi4_3",
"Ubi6_1", "Ubi6_2", "Ubi6_3",
"Ctrl_1", "Ctrl_2", "Ctrl_3",
"Ubi1_1", "Ubi1_2", "Ubi1_3"),
condition = c("Ubi4", "Ubi4", "Ubi4",
"Ubi6", "Ubi6", "Ubi6",
"Ctrl", "Ctrl", "Ctrl",
"Ubi1", "Ubi1", "Ubi1"),
replicate = c(rep(1:3, times = 4))
)
# define the variable types
exp_design$label <- as.character(exp_design$label)
exp_design$condition <- as.character(exp_design$condition)
exp_design$replicate <- as.numeric(exp_design$replicate)
data_se <- make_se(data_unique, intensity_columns, exp_design)
# plot identifications per each condition
data_se
## class: SummarizedExperiment
## dim: 2910 12
## metadata(0):
## assays(1): ''
## rownames(2910): RBM47 UBA6 ... ATXN2.3 X6RHB9
## rowData names(13): Protein.IDs Majority.protein.IDs ... name ID
## colnames(12): Ubi4_1 Ubi4_2 ... Ubi1_2 Ubi1_3
## colData names(4): label ID condition replicate
dim(data_se)
## [1] 2910 12
#dimnames(data_se)
#colData(data_se)
#assay(data_se)
head(assay(data_se))
## Ubi4_1 Ubi4_2 Ubi4_3 Ubi6_1 Ubi6_2 Ubi6_3 Ctrl_1 Ctrl_2
## RBM47 25.09293 24.55807 24.83147 24.75496 24.47172 24.96301 24.62634 24.89851
## UBA6 NA NA NA NA NA NA 22.79170 23.03797
## ILVBL NA NA NA NA NA NA 24.83388 23.92488
## ACO2 24.63423 24.34157 24.55086 23.97132 24.53600 24.67417 24.62673 24.30529
## RPRD1B NA NA NA NA NA NA NA NA
## AAR2 NA NA NA NA NA NA NA NA
## Ctrl_3 Ubi1_1 Ubi1_2 Ubi1_3
## RBM47 24.45989 24.72151 24.94566 24.89027
## UBA6 22.89800 23.24012 23.56232 23.10530
## ILVBL 24.53979 24.66942 24.40756 24.67061
## ACO2 24.27831 24.68212 24.40139 23.78087
## RPRD1B NA NA 21.48871 NA
## AAR2 NA NA 21.75776 NA
#data_se$condition
#data_se@assays@data@listData
# plot identifications per each condition
plot_numbers(data_se) # pokud by napr kontrola mela jen 200 vzorku, tak bych ji tady vyradila a sjela summarised experiment znovu
Proteomics data contain many missing values, therefore we pre-filter the dataset to get rid of the most unreliable identifications.
Firstly, we plot a barplot of the protein identification overlap between samples:
plot_frequency(data_se)
Based on that we can choose the filtering threshold using either of two functions:
Thr means a threshold of allowed missing values in at least one condition. So, thr=1 means one missing value is allowed in at least one condition (2/3 replicates of >=1 condition)
data_filt <- filter_missval(data_se, thr = 1) # thr = 1 nastavuje threshold, kolik validnich hodnot vyzaduju / missing toleruju
plot_numbers(data_filt)
plot_frequency(data_filt)
We need to remove the technical variability using one of normalization approaches. DEP by default uses variance stabilizing normalization (vsn). We will try several more from the limma package:
Vsn normalization
data_norm_vsn <- normalize_vsn(data_filt)
meanSdPlot(data_norm_vsn)
plot_normalization(data_filt, data_norm_vsn)
Quantile normalization
data_norm_quant <- data_filt
assay(data_norm_quant)<-limma::normalizeQuantiles(assay(data_norm_quant))
meanSdPlot(data_norm_quant)
plot_normalization(data_filt, data_norm_quant)
LoessF normalization
data_norm_loess <- data_filt
assay(data_norm_loess)<-limma::normalizeCyclicLoess(assay(data_norm_loess))
meanSdPlot(data_norm_loess)
plot_normalization(data_filt, data_norm_loess)
Median normalization
data_norm_med <- data_filt
assay(data_norm_med)<-limma::normalizeMedianValues(assay(data_norm_med))
meanSdPlot(data_norm_med)
plot_normalization(data_filt, data_norm_med)
We will continue for now with the default, vsn normalization.
Proteomic data contain high number of missing values. There are three different types of missing values:
Based on the type of missingness, we choose appropriate imputation algorithm:
For MNAR:
For MCAR:
For both, MNAR+MCAR: - mixed imputation or use different package, e.g. imp4p
Plot heatmap of missing proteins in at least 1 condition (white = NA; black = present)
plot_missval(data_filt) # zobrazuje pouze proteiny, u kterych je alespon 1 missing hodnota v alespon jedne kondici/replikatu
Plot intensity distributions and cumulative fraction of proteins with and without missing values
plot_detect(data_filt) # cokoli pod cca 20-25 znaci, ze je random
Impute the missing values
data_imp <- impute(data_norm_vsn, fun = "man", shift = 1.8, scale = 0.3)
plot_imputation(data_norm_vsn, data_imp)
Plot PCA for 500 most variable proteins
plot_pca(data_imp, x = 1, y = 2, n = 500, point_size = 4)
# pokryva velkou cast variability (46 a 15%)
# dobre to klastruje, podle kondic
# pokud by klastrovalo dle replikatu, je tam batch, tzn. odstranit a zacit od zacatku
We want to get differentially expressed proteins now, between the bait and control. For that, we will be using limma test.
# Test every sample versus control
data_diff <- test_diff(data_imp, type = "control", control = "Ctrl")
# Or we can manually define contrasts as well:
# Test manually defined comparisons
#data_diff_manual <- test_diff(data_imp, type = "manual",
# test = c("Ubi4_vs_Ctrl", "Ubi6_vs_Ctrl"))
Now, we need to set cutoffs for calling a protein differentially expressed: - logFC -> 1 - adjusted p-value -> 0.05
So, proteins with (logFC > 1 & adj.pvalue < 0.05) will be upregulated, and proteins with (logFC < -1 & adj.pvalue < 0.05) will be downregulated.
We also need to correct for multiple testing, which in DEP is done by fdrtools.
# Denote significant proteins based on user defined cutoffs
dep <- add_rejections(data_diff, alpha = 0.05, lfc = log2(1))
Now let’s see how our differentially expressed proteins look like:
# Plot the Pearson correlation matrix
plot_cor(dep, significant = TRUE, lower = 0, upper = 1, pal = "Reds")
# Plot a heatmap of all significant proteins with the data centered per protein
plot_heatmap(dep, type = "centered", kmeans = TRUE,
k = 6, col_limit = 4, show_row_names = FALSE, # k je pocet clusteru, nejdriv na to kouknout a stanovit pocet
indicate = c("condition", "replicate"))
# Plot a heatmap of all significant proteins (rows) and the tested contrasts (columns)
plot_heatmap(dep, type = "contrast", kmeans = TRUE,
k = 6, col_limit = 10, show_row_names = FALSE)
Ubi1 vs Ctrl
plot_volcano(dep, contrast = "Ubi1_vs_Ctrl", label_size = 2, add_names = TRUE, adjusted = TRUE)
# upregulovane je vpravo - vice exprimovane v Ubi1 nez v kontrole
Ubi4 vs Ctrl
plot_volcano(dep, contrast = "Ubi4_vs_Ctrl", label_size = 2, add_names = TRUE, adjusted = TRUE)
Ubi6 vs Ctrl
plot_volcano(dep, contrast = "Ubi6_vs_Ctrl", label_size = 2, add_names = TRUE, adjusted = TRUE)
We can also plot how proteins of our interest behave across conditions:
plot_single(dep, proteins = "USP15")
plot_single(dep, proteins = "USP15", type = "centered")
Several types of plots can be also generated at once, using the plot_all command:
#plot_all(dep, plots = c("volcano", "heatmap", "single", "freq", "comparison"))
Now, we will generate table with the results of differential expression:
data_results <- get_results(dep)
# We can also get data in wide or long format:
# df_wide <- get_df_wide(dep)
# df_long <- get_df_long(dep)
d.adjust <- data_results
d.adjust$Ubi6_vs_Ctrl_p.val_adj <- p.adjust(d.adjust$Ubi6_vs_Ctrl_p.val, method = "BH")
For better work, we can also generate interactive table:
data_results %>%
select(ID, name, ends_with(c('_p.val', '_p.adj', '_ratio'))) %>%
datatable(extensions = 'Buttons',
options = list(dom = 'Blfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
lengthMenu = list(c(10,25,50,-1),
c(10,25,50,"All"))))
We can also save the workspace for further use:
save(data_se, data_norm_vsn, data_imp, data_diff, dep, file = here('outputs', "data.RData"))
# These data can be loaded in future R sessions using this command
# load("data.RData")
Last, for the reproducibility, we should also save information about which packages we used using the SessionInfo() command:
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Czech_Czechia.1250 LC_CTYPE=Czech_Czechia.1250
## [3] LC_MONETARY=Czech_Czechia.1250 LC_NUMERIC=C
## [5] LC_TIME=Czech_Czechia.1250
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] SummarizedExperiment_1.22.0 Biobase_2.52.0
## [3] GenomicRanges_1.44.0 GenomeInfoDb_1.28.4
## [5] IRanges_2.26.0 S4Vectors_0.30.2
## [7] BiocGenerics_0.38.0 MatrixGenerics_1.4.3
## [9] matrixStats_0.62.0 DT_0.22
## [11] dplyr_1.0.8 DEP_1.14.0
## [13] here_1.0.1
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-3 rjson_0.2.21 ellipsis_0.3.2
## [4] rprojroot_2.0.3 circlize_0.4.14 XVector_0.32.0
## [7] GlobalOptions_0.1.2 clue_0.3-60 rstudioapi_0.13
## [10] hexbin_1.28.2 farver_2.1.0 mzR_2.26.1
## [13] affyio_1.62.0 ggrepel_0.9.1 fansi_1.0.3
## [16] mvtnorm_1.1-3 codetools_0.2-18 ncdf4_1.19
## [19] doParallel_1.0.17 impute_1.66.0 knitr_1.39
## [22] jsonlite_1.8.0 Cairo_1.5-15 cluster_2.1.2
## [25] vsn_3.60.0 png_0.1-7 shinydashboard_0.7.2
## [28] shiny_1.7.1 BiocManager_1.30.17 readr_2.1.2
## [31] compiler_4.1.1 assertthat_0.2.1 Matrix_1.3-4
## [34] fastmap_1.1.0 gmm_1.6-6 limma_3.48.3
## [37] cli_3.2.0 later_1.3.0 htmltools_0.5.2
## [40] tools_4.1.1 gtable_0.3.0 glue_1.6.2
## [43] GenomeInfoDbData_1.2.6 affy_1.70.0 Rcpp_1.0.8.3
## [46] MALDIquant_1.21 jquerylib_0.1.4 vctrs_0.4.1
## [49] preprocessCore_1.54.0 crosstalk_1.2.0 iterators_1.0.14
## [52] tmvtnorm_1.5 xfun_0.29 stringr_1.4.0
## [55] mime_0.12 lifecycle_1.0.1 XML_3.99-0.9
## [58] zlibbioc_1.38.0 MASS_7.3-54 zoo_1.8-10
## [61] scales_1.2.0 MSnbase_2.18.0 pcaMethods_1.84.0
## [64] hms_1.1.1 promises_1.2.0.1 ProtGenerics_1.24.0
## [67] sandwich_3.0-1 RColorBrewer_1.1-3 ComplexHeatmap_2.8.0
## [70] yaml_2.3.5 gridExtra_2.3 ggplot2_3.3.5
## [73] sass_0.4.1 stringi_1.7.6 highr_0.9
## [76] foreach_1.5.2 BiocParallel_1.26.2 shape_1.4.6
## [79] rlang_1.0.2 pkgconfig_2.0.3 bitops_1.0-7
## [82] imputeLCMD_2.0 mzID_1.30.0 evaluate_0.15
## [85] lattice_0.20-44 purrr_0.3.4 labeling_0.4.2
## [88] htmlwidgets_1.5.4 tidyselect_1.1.2 norm_1.0-10.0
## [91] plyr_1.8.7 magrittr_2.0.3 R6_2.5.1
## [94] magick_2.7.3 generics_0.1.2 DelayedArray_0.18.0
## [97] DBI_1.1.2 pillar_1.7.0 MsCoreUtils_1.4.0
## [100] RCurl_1.98-1.6 tibble_3.1.6 crayon_1.5.1
## [103] fdrtool_1.2.17 utf8_1.2.2 tzdb_0.3.0
## [106] rmarkdown_2.14 GetoptLong_1.0.5 grid_4.1.1
## [109] digest_0.6.29 xtable_1.8-4 tidyr_1.2.0
## [112] httpuv_1.6.5 munsell_0.5.0 bslib_0.3.1